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Abstract —Deep learning with a large number of parameters 
requires distributed training, where model accuracy and runtime 
are two important factors to be considered. However, there has 
been no systematic study of the tradeoff between these two factors 
during the model training process. This paper presents Rudra, a 
parameter server based distributed computing framework tuned 
for training large-scale deep neural networks. Using variants of 
the asynchronous stochastic gradient descent algorithm we study 
the impact of synchronization protocol, stale gradient updates, 
minibatch size, learning rates, and number of learners on runtime 
performance and model accuracy. We introduce a new learning 
rate modulation strategy to counter the effect of stale gradients 
and propose a new synchronization protocol that can effectively 
bound the staleness in gradients, improve runtime performance 
and achieve good model accuracy. Our empirical investigation 
reveals a principled approach for distributed training of neural 
networks: the mini-batch size per learner should be reduced 
as more learners are added to the system to preserve the model 
accuracy. We validate this approach using commonly-used image 
classification benchmarks: CIFARIO and ImageNet. 

I. Introduction 

Deep neural network based models have achieved unparal¬ 
leled accuracy in cognitive tasks such as speech recognition, 
object detection, and natural language processing ifTSll . For 
certain image classification benchmarks, deep neural net¬ 
works have been touted to even surpass human-level perfor¬ 
mance (OlITTl. Such accomplishments are made possible by 
the ability to perform fast, supervised training of complex 
neural network architectures using large quantities of labeled 
data. Training a deep neural network translates into solving a 
non-convex optimization problem in a very high dimensional 
space, and in the absence of a solid theoretical framework 
to solve such problems, practitioners are forced to rely on 
trial-and-error empirical observations to design heuristics that 
help obtain a well-trained model 111. Naturally, fast training of 
deep neural network models can enable rapid evaluation of 
different network architectures and facilitate a more thorough 
hyper-parameter optimization for these models. Recent years 
have seen a resurgence of interest in deploying large-scale 
computing infrastructure designed specifically for training 
deep neural networks. Some notable efforts in this direction 
include distributed computing infrastructure using thousands 
of CPU cores |3l[6l, high-end graphics processors (GPUs)|T6l, 
or a combination of CPUs and GPUs n 

The large-scale deep learning problem can hence be viewed 
as a confiuence of elements from machine learning (ML) and 
high-performance computing (HPC). Much of the work in the 
ML community is focused on non-convex optimization, model 


selection, and hyper-parameter tuning to improve the neural 
network’s performance (measured as classification accuracy) 
while working under the constraints of the computational 
resources available in a single computing node (CPU with 
or without GPU acceleration). From a HPC perspective, prior 
work has addressed, to some extent, the problem of accel¬ 
erating the neural network training by mapping the asyn¬ 
chronous version of mini-batch stochastic gradient descent 
(SGD) algorithm onto multiple computing nodes. Contrary 
to the popular belief that asynchrony necessarily improves 
model accuracy, we find that adopting the approach of scale- 
out deep learning using asynchronous-SGD, gives rise to 
complex interdependencies between the training algorithm’s 
hyperparameters and the distributed implementation’s design 
choices (synchronization protocol, number of learners), ulti¬ 
mately impacting the neural network’s accuracy and the overall 
system’s runtime performance. 

In this paper we present Rudra, a parameter server based 
deep learning framework to study these interdependencies 
and undertake an empirical evaluation with public image 
classification benchmarks. Our key contributions are: 

1) A systematic technique (vector clock) for quantifying the 
staleness of gradient descent parameter updates. 

2) An investigation of the impact of the interdependence 
of training algorithm’s hyperparameters (mini-batch size, 
learning rate (gradient descent step size)) and distributed 
implementation’s parameters (gradient staleness, number 
of learners) on the neural network’s classification accu¬ 
racy and training time. 

3) A new learning rate tuning strategy that reduces the effect 
of stale parameter updates. 

4) A new synchronization protocol to reduce network band¬ 
width overheads while achieving good classification ac¬ 
curacy and runtime performance. 

5) An observation that to maintain a given level of model 
accuracy, it is necessary to reduce the mini-batch size as 
the number of learners is increased. This suggests a hard 
limit on the amount of parallelism that can be exploited 
in training a given model. 

H. Background 

A neural network computes a parametric, non-linear trans¬ 
formation fo : X \-^Y, where 0 represents a set of adjustable 
parameters (or weights). In a supervised learning context 
(such as image classification), X is the input image and Y 
corresponds to the label assigned to the image. A deep neural 


network organizes the parameters 0 into multiple layers, each 
of which consists of a linear transformation followed by a non¬ 
linear function such as sigmoid, tanh, etc. In a feed-forward 
deep neural network, the layers are arranged hierarchically 
such that the output of the layer I — 1 feeds into the input 
of layer 1. The terminal layer generates the network’s output 
Y = /q^X), corresponding to the input X. 

A neural network training algorithm seeks to find a set 
of parameters 6>* that minimizes the discrepancy between Y 
and the ground truth Y. This is usually accomplished by 
defining a differentiable cost function C{Y,Y) and iteratively 
updating each of the model parameters using some variant of 
the gradient descent algorithm: 
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where represents the parameter at iteration t, a 

is the step size (also known as the learning rate) and m is 
the batch size. The batch gradient descent algorithm sets m 
to be equal to the total number of training examples N. Due 
to the large amount of training data, deep neural networks 
are typically trained using the Stochastic Gradient Descent 
(SGD), where the parameters are updated with a randomly 
selected training example (Ag, Fg). The performance of SGD 
can be improved by computing the gradients using a mini¬ 
hatch containing m = ji N training examples. 

Deep neural networks are generally considered hard to 
train EiiiiEa and the trained model’s generalization error 
depends strongly on hyperparameters such as the initializa¬ 
tions, learning rates, mini-batch size, network architecture, 
etc. In addition, neural networks are prone to overfit the data. 
Regularization methods (e.g., weight decay and dropout) ca 
applied during training have been shown to combat overfitting 
and reduce the generalization error. 

Scale-out deep learning: A typical implementation of dis¬ 
tributed training of deep neural networks consists of a mas¬ 
ter (parameter server) that orchestrates the work among one 
or more slaves (learners). Each learner does the followings: 

1) getMinibatch: Select randomly a mini-batch of ex¬ 
amples from the training data. 

2) pullWeights: Request the parameter server for the 
current set of weights/parameters. 

3) calcGradient: Compute gradients based on the train¬ 
ing error for the current mini-batch (equation p^. 

4) pushGradient: Send the computed gradients to the 
parameter server 

The parameter server maintains a global view of the model 

weights and performs the following functions: 

1) sumGradients: Receive and accumulate the gradients 

from the learners. 

2) applyUpdate: Multiply the accumulated gradient by 
the learning rate and update the weights (equation 

Learners exploit data parallelism by each maintaining a 
copy of the entire model, and training independently over a 
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unique mini-batch. The model parallelism approach augments 
this framework by splitting the neural network model across 
multiple learners. With model parallelism, each learner trains 
only a portion of the network; edges that cross learner bound¬ 
aries must be synchronized before gradients can be computed 
for the entire model. 

Several different synchronization strategies are possible. 
The most commonly used one is the asynchronous protocol, 
in which the learners work completely independently of each 
other and the parameter server. Section III will discuss three 
different synchronization strategies, each associated with a 
unique tradeoff between model accuracy and runtime. 


III. Design and Implementation 
A. Terminology 

Throughout the paper, we use the following definitions: 

• Parameter Server: a server that holds the model weights. 
ll^ describes a typical parameter server using a dis¬ 
tributed key-value store to synchronize state between 
processes. The parameter server collects gradients from 
learners and updates the weights accordingly. 

• Learner: A computing process that can calculate weight 
updates (gradients). 

• p: mini-batch size. 

• a: learning rate. 

• A: number of learners. 

• Epoch: a pass through the entire training dataset. 

• Timestamp: we use a scalar clock to represent 
weights timestamp tsi, starting from i = 0. Each weight 
update increments the timestamp by 1. The timestamp of 
a gradient is the same as the timestamp of the weight 
used to compute the gradient. 

• a: staleness of the gradient. A gradient with timestamp 
tsi is pushed to the parameter server with current weight 
timestamp tsj, where tsj > tSi. We define the staleness 
of this gradient a sls j — i. 

• (cr), average staleness of gradients. The timestamps 

of the set of n gradients that triggers the ad¬ 
vancement of weights timestamp from tsi-i to tSi 
form a vector clock ifTTIl where 

max{A, ^ 2 , •••, ^n} < L The average staleness of gradi¬ 
ents (cr) is defined as: 

(cr) = (z - 1) - mean{ii,i 2 , (2) 


• Hardsync protocol: To advance weights timestamp from 
tSi to each learner calculates exactly one mini¬ 

batch and sends its gradient VOi to the parameter server. 
The parameter server averages the gradients and updates 
the weights according to Equation ([^, then broadcasts 
the new weights to all learners. Staleness in the hardsync 
protocol is always zero. 
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• Async protocol: Each learner calculates the gradients 
and asynchronously pushes/pulls the gradients/weights 
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Fig. 2. Rudra-adv architecture 


Fig. 1. Rudra-base architecture 

to/from parameter server. The Async weight update rule 
is given by: 

e Li,...,La 

+ 1 ) = 

Gradient staleness may be hard to control due to the 
asynchrony in the system. describe Downpour SGD, 
an implementation of the Async protocol for a commodity 
scale-out system in which the staleness can be as large 
as hundreds. 

• n-softsync protocol: Each learner pulls the weights from 
the parameter server, calculates the gradients and pushes 
the gradients to the parameter server. The parameter 
server updates the weights after collecting at least c = 
[(A/n)J gradients. The splitting parameter n can vary 
from 1 to A. The n-softsync weight update rule is given 
by: 

c = L(A/n)J 

V0pXL,- € Li,...,La (5) 
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In Section |V-A| we will show that in a homogeneous 
cluster where each learner proceeds at roughly the same 
speed, the staleness of the model can be empirically 
bounded at 2n. Note that when n is equal to A, the 
weight update rule at the parameter server is exactly 
the same as in Async protocol. 

B. Rudra-base System Architecture 

Figure illustrates the parameter server design that we use 
to study the interplay of hyperparameter tuning and system 
scale-out factor. This system implements both hardsync and n- 
softsync protocols. The arrows between each entity represent 
a (group of) MPI message(s), except the communication 
between Learner and Data Server, which is achieved by a 
global file system. We describe each entity’s role and its 
implementation below. 

Learner is a single-process multithreaded SGD solver. Before 
training each mini-batch, a learner pulls the weights and 
the corresponding timestamp from the parameter server. A 
learner reduces the pullWeights traffic by first inquiring 
the timestamp from the parameter server: if the timestamp is 
as old as the local weights’, then this learner does not pull 
the weights. After training the mini-batch, learner sends the 


gradients along with gradients’ timestamp to parameter server. 
The size of pull and push messages is the same as the model 
size plus the size of scalar timestamp equal to one. 

Data Server is hosted on IBM GPFS, a global file system. 
Each learner has an I/O thread, which prefetches the mini¬ 
batch via random sampling prior to training. Prefetching is 
completely overlapped with the computation. 

Parameter Server is a multithreaded process, that accumulates 
gradients from each learner and applies update rules according 
to Equations In this study, we implemented hardsync 

protocol and n-softsync protocol. Learning rate is configured 
differently in either protocol. In hardsync protocol, the learning 
rate is multiplied by a factor A^Xp/B, where B is the batch 
size of the reference model. In the n-softsync protocol, the 
learning rate is multiplied by the reciprocal of staleness. We 
demonstrate in Section that this treatment of learning rate 
in n-softsync can significantly improve the model accuracy. 
Parameter server records the vector clock of each weight 
update to keep track of the the average staleness. When a 
specified number of epochs are trained, parameter server shuts 
down each learner. 

Statistics Server is a multithreaded process that receives the 
training error from each learner and receives the model from 
the parameter server at the end of each epoch and tests the 
model. It monitors the model training quality. 

This architecture is non-blocking everywhere except for 
pushing up gradients and pushing down weights, which are 
blocking MPI calls (e.g. MPI_Send). Parameter server han¬ 
dles each incoming message one by one (the message handling 
itself is multithreaded). In this way, we can precisely control 
how each learner’s gradients are received and handled by the 
parameter server. The purpose of Rudra-base is to control the 
noise of the system, so that we can study the interplay of scale- 
out factor and the hyperparameter selection. For a moderately- 
sized dataset like CIFAR-10, Rudra-base shows good scale-out 
factor (see Section [V^ . 

C. Rudra-adv and Rudra-adv"" System Architecture 

To achieve high classification accuracy, the required model 
size may be quite large (e.g. hundreds of MBs). In many cases, 
to achieve best possible model accuracy, mini-batch size ji 
must be small, as we will demonstrate in Section [V^ In order 
to meet these requirements with acceptable performance, we 
implemented Rudra-adv and Rudra-adv*. 





























Rudra-adv system architecture. Rudra-base clearly is not 
a scalable solution when the model gets large. Under ideal 
circumstances (see Section |IV-A| for our experimental hard¬ 
ware system specification), a single learner pushing a model 
of 300 MB (size of a typical deep neural network, see sec¬ 
tion would take more than 10 ms to transfer this data. If 
16 tasks are sending 300 MB to the same receiver and there is 
link contention, it would take over a second for the messages 
to be delivered. 

To alleviate the network traffic to parameter server, we 
build a parameter server group that forms a tree. We co¬ 
locate each tree leaf node on the same node as the learners 
for which it is responsible. Each node in the parameter server 
group is responsible for averaging the gradients sent from its 
learners and relaying the averaged gradient to its parent. The 
root node in the parameter server group is responsible for 
applying weight update and broadcasting the updated weights. 
Each non-leaf node pulls the weights from its parent and 
responds to its children’s weight pulling requests. Rudra-adv 
can significantly improve performance compared to Rudra- 
base and manage to scale out to large model and small /i, while 
maintaining the control of the gradients’ staleness. Eigure [2^ 
illustrates the system architecture for Rudra-adv. Red boxes 
represent the parameter server group, in which the gradients 
are pushed and aggregated upwards. Green boxes represent 
learners, each learner pushes the gradient to its parameter 
server parent and receives weights from its parameter server 
parent. The key difference between Rudra-adv and a sharded 
parameter server design (e.g., Distbelief in and Adam O) 
is that the weights maintained in Rudra-adv have the same 
timestamps whereas shared parameter servers maintain the 
weights with different timestamps. Having consistent weights 
makes the analysis of hyperparameter/scale-out interplay much 
more tractable. 

Rudra-adv* system architecture. We built Rudra-adv* to 
further improve the runtime performance in two ways: 

Broadcast weights within learners. To further reduce the 
traffic to the parameter server group, we form a tree within 
all learners and broadcast the weights down this tree. In this 
way the network links to/from learners are also utilized. 

Asynchronous pushGradient and pullWeights. Ide¬ 
ally, one would use MPI non-blocking send calls to asyn¬ 
chronously send gradients and weights. However, depend¬ 
ing on the MPI implementation, it is difficult to guarantee 
if the non-blocking send calls make progress in the back¬ 
ground m Therefore we open additional communication 
threads and use MPI blocking send calls in the threads. Each 
learner process runs two additional communication threads: 
the pullWeights thread and pushGradient thread. In 
this manner, computing can continue without waiting for the 
communication. Note that since we need to control ji (the 
smaller /i is, the better model converges, as we demon¬ 
strate in Section V-B| ), we must guarantee that the learner 
pushes each calculated gradient to the server. Alternatively, 
one could locally accrue gradients and send the sum, as 
in i), however that will effectively increase /i. Eor this 


Implementation 

Communication overlap (%) 

Rudra-base 

11.52 

Rudra-adv 

56.75 

Rudra-adv* 

99.56 


TABLE 1 

Communication overlap measured in Rudra-base, Rudra-adv, 
Rudra-adv* eor an adversarial scenario, where the mini-batch 

SIZE IS THE SMALLEST POSSIBLE EOR 4-WAY MULTI-THREADED 
LEARNERS, MODEL SIZE 300MB, AND THERE ARE ABOUT 60 LEARENERS. 


reason, the pushGradient thread cannot start sending the 
current gradient before the previous one has been delivered. 
As demonstrated in Table |T| that as long as we can optimize 
the use of network links, this constraint has no bearing on 
the runtime performance, even when /i is extremely small. 
In contrast, pullWeights thread has no such constraint 
- we maintain a computation buffer and a communication 
buffer for pullWeights thread, and the communication 
always happens in the background. To use the newly received 
weights only requires a pointer swap. Figure [2^ illustrates the 
system architecture for Rudra-adv*. Different from Rudra-adv, 
each learner continuously receives weights from the weights 
downcast tree, which consists of the top level parameter server 
node and all the learners. 

We measure the communication overlap by calculating the 
ratio between computation time and the sum of computation 
and communication time. Table U records the the communica¬ 
tion overlap for Rudra-base, Rudra-adv, and Rudra-adv* in 
an adversarial scenario. Rudra-adv* can almost completely 
overlap computation with communication. Rudra-adv* can 
scale out to very large model size and work with smallest 
possible size of mini-batch. In Section V-E we demonstrate 
Rudra-adv*’s effectiveness in improving runtime performance 
while achieving good model accuracy. 


IV. Methodology 
A. Hardware and software environment 

We deploy the Rudra distributed deep learning framework 
on a P775 supercomputer. Each node of this system contains 
four eight-core 3.84 GHz POWER7 processors, one optical 
connect controller chip and 128 GB of memory. A single 
node has a theoretical fioating point peak performance of 
982 Gfiop/s, memory bandwidth of 512 GB/s and bi-directional 
interconnect bandwidth of 192 GB/s. 

The cluster operating system is Red Hat Enterprise Linux 
6.4. To compile and run Rudra we used the IBM xlC compiler 
version 12.1 with the -03 -q64 -qsmp options, ESSL for 
BLAS subroutines, and IBM MPI (IBM Parallel Operating 
Environment 1.2). 


B. Benchmark datasets and neural network architectures 

To evaluate Rudra’s scale-out performance we employ 
two commonly used image classification benchmark datasets: 
CIFARIO (13 and ImageNet 0. The CIFARIO dataset 
comprises of a total of 60,000 RGB images of size 32 x 
32 pixels partitioned into the training set (50,000 images) and 
the test set (10,000 images). Each image belongs to one of the 





















10 classes, with 6000 images per class. For this dataset, we 
construct a deep convolutional neural network (CNN) with 3 
convolutional layers each followed by a subsampling/pooling 
layer. The output of the 3^^ pooling layer connects, via a 
fully-connected layer, to a 10-way softmax output layer that 
generates a probability distribution over the 10 output classes. 
This neural network architecture closely mimics the CIFARl 0 
model (cifarlO_full.prototxt) available as a part of the open- 
source Caffe deep learning package ifT^ . The total number of 
trainable parameters in this network are ^ 90 K, resulting 
in the model size of ^350 kB when using 32-bit floating 
point data representation. The neural network is trained using 
momentum-accelerated mini-batch SGD with a batch size of 
128 and momentum set to 0.9. As a data preprocessing step, 
the per-pixel mean is computed over the entire training dataset 
and subtracted from the input to the neural network. The 
training is performed for 140 epochs and results in a model 
that achieves 17.9% misclassiflcation error rate on the test 
dataset. The base learning rate ao is set to 0.001 are reduced 
by a factor of 10 after the 120^^ and 130* epoch. This learning 
rate schedule proves to be quite essential in obtaining the low 
test error of 17.9%. 

Our second benchmark dataset is collection of natural 
images used as a part of the 2012 edition of the I mage Net 
Large Scale Visual Recognition Challenge (ILSVRC 2012). 
The training set is a subset of the hand-labeled I mage Net 
database and contains 1.2 million images. The validation 
dataset has 50,000 images. Each image maps to one of 
the 1000 non-overlapping object categories. The images are 
converted to a flxed resolution of 256x256 to be used input 
to a deep convolution neural network. For this dataset, we 
consider the neural network architecture introduced in m 
consisting of 5 convolutional layers and 3 fully-connected 
layers. The last layer outputs the probability distribution over 
the 1000 object categories. In all, the neural network has 
r-^12 million trainable parameters and the total model size is 
289 MB. The network is trained using momentum-accelerated 
SGD with a batch size of 256 and momentum set to 0.9. 
Similar to the CIFARl 0 benchmark, per-pixel mean computed 
over the entire training dataset is subtracted from the input 
image feeding into the neural network. To prevent overfltting, 
a weight regularization penalty of 0.0005 is applied to all the 
layers in the network and a dropout of 50% is applied to the 
and 2”^ fully-connected layers. The initial learning rate ao 
is set equal to 0.01 and reduced by a factor of 10 after the 
15* and 25* epoch. Training for 30 epochs results in a top-1 
error of 43.95% and top-terror of 20.55% on the validation 
set. 


V. Evaluation 

In this section we present results of evaluation of our scale- 
out deep learning training implementation. For an initial design 

^The top-5 error corresponds to the fraction of samples where the correct 
label does not appear in the top-5 labels considered most probable by the 
model 
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Fig. 3. Average staleness (a) of the gradients as a function of the weight 
update step at the parameter server when using (a) 1-softsync, 2-softsync and 
(b) A-softsync protocol. Inset in (b) shows the distribution of the gradient 
staleness values for A-softsync protocol. Number of learners A is set to 30. 


space exploration, we use the CIFARl0 dataset and Rudra- 
base system architecture. Subsequently we extend our flndings 
to the ImageNet dataset using the Rudra-adv and Rudra-adv* 
system architectures. 

A. Stale gradients 


In the hardsync protocol introduced in section III-A the 
transition from 0{i) to 6>(i + 1) involves aggregating the gra¬ 
dients calculated with 0{i). As a result, each of the gradients 
VOi carries with it a staleness a equal to 0. However, departing 
from the hardsync protocol towards either the n-softsync or 
the Async protocol inevitably adds staleness to the gradients, 
as a subset of the learners contribute gradients calculated using 
weights with timestamp earlier than the current timestamp of 
the weights at the parameter server. 

To measure the effect of gradient staleness when using the 
n-softsync protocol, we use the CIFARl0 dataset and train 
the neural network described in section |IV-B using A = 30 
learners. For the 1-softsync protocol, the parameter server 
updates the current set of weights when it has received a total 
of 30 gradients from the learners. Similarly, the 2-softsync 
protocol forces the parameter server to accumulate A/2 = 15 
gradient contributions from the learners before updating the 
weights. As shown in Figure [3^ the average staleness (a) for 
the 1-softsync and 2-softsync protocols remains close to 1 and 
2, respectively. In the 1-softsync protocol, the staleness ctli for 
the gradients computed by the learner Li takes values 0, 1, or 
2, whereas G {0,1, 2, 3,4} for the 2-softsync protocol. 
Figure |3(b) shows the gradient staleness for the n-softsync 
protocol where n = A = 30. In this case, the parameter server 
updates the weights after receiving a gradient from any of the 
learners. A large fraction of the gradients have staleness close 
to 30, and only with a very low probability (< 0.0001) does a 
exceed 2n = 60. These measurements show that, in general, 
G {0,1,..., 2n} and (cr) = n for our implementation of 
the n-softsync protocol. 

Modifying the learning rate for stale gradients: In our ex¬ 
periments with the n-softsync protocol we found it beneflcial, 
and at times necessary, to modulate the learning rate a to take 
into account the staleness of the gradients. For the n-softsync 
protocol, we set the learning rate as: 

a — ao/{(j) — ao/n ( 6 ) 


where ao is the learning rate used for the baseline (control) 
experiment: /i = 128, A = 1. Figure shows a set of 
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Fig. 4. Effect of learning rate modulation strategy: Dividing the learning 
rate by the average staleness aids in better convergence and achieves lower 
test error when using the n-softsync protocol. Number of learners, A = 30; 
mini-batch size ii = 128. 



Fig. 5. (a, 12 , X) tradeoff curves for the hardsync protocol. The dashed black 
line represents the 17.9% test error achieved by the baseline model (a, fi, A) 
= (0,128,1) on the CIFARIO dataset. 

representative results illustrating the benefits of adopting this 
learning rate modulation strategy. We show the evolution 
of the test error on the CIFARIO dataset as a function of 
the training epoch for two different configurations of the n- 
softsync protocol (n = 4, n = 30) and set the number of 
learners, A = 30. In both these configurations, setting the 
learning rate in accordance with equation ([^ results in lower 
test error as compared with the cases where the learning rate is 
set to (Tq. Surprisingly, the configuration 30-softsync, A = 30, 
a = fails to converge and shows a constant high error rate 
of 90%. Reducing the learning rate by a factor (a) = n = 30 
makes the model with much lower test erroiEl 

B. (cr, /i, A) tradeoff curves 

Hyperparameter optimization plays a central role in ob¬ 
taining good accuracy from neural network models lO. For 
the SGD training algorithm, this includes a search over the 
neural network’s training parameters such as learning rates, 
weight regularization, depth of the network, mini-batch size 
etc. in order to improve the quality of the trained neural 
network model (quantified as the error on the validation 
dataset). Additionally, when distributing the training problem 
across multiple learners, parameters such as the number of 
learners and the synchronization protocol enforced amongst 
the learners impact not only the runtime of the algorithm but 
also the quality of the trained model. 

An exhaustive search over the space defined by these 
parameters for joint optimization of the runtime performance 
and the model quality can prove to be a daunting task even 

^Although not explored as a part of this work, it is certainly possible to 
implement a finer-grained learning rate modulation strategy that depends on 
the staleness of each of gradients computed by the learners instead of the 
average staleness. Such a strategy should apply smaller learning rates to staler 
gradients 




Fig. 6 . (fj, 12 , A) tradeoff curves for (a) A-softsync protocol and (b) 1-softsync 
protocol. Shaded region in shows the region bounded by yu = 128, A = 30, 
and fi = 4: contours for the hardsync protocol. A G {1,2,4,10,18,30} 
and 12 G {4, 8,16, 32, 64,128}. Note that for A = 1, n-softsync protocol 
degenerates to the hardsync protocol 

for a small model such as that used for the CIFARIO 
dataset, and clearly intractable for models and datasets the 
scale of ImageNet. To develop a better understanding of the 
interdependence among the various tunable parameters in the 
distributed deep learning problem, we introduce the notion of 
(cr, /i, A) tradeoff curves. A (cr, /i, A) tradeoff curve plots the 
error on the validation set (or test set) and the total time to 
train the model (wall clock time) for different configurations 
of average gradient staleness (cr), mini-batch size per learner 
/i, and the number of learners A. The configurations (cr, /i. A) 
that achieve the virtuous combination of low test error and 
small training time are preferred and form ideal candidates 
for further hyperparameter optimization. 

We rui0 the CIFARIO benchmark for 
A G {1,2,4,10,18,30} and /i G {4,8,16,32,64,128}. 
Figure shows a set of (cr, /i. A) curves for the hardsync 
protocol i.e. cr = 0. The baseline configuration with A = 1 
learner and mini-batch size /i = 128 achieves a test error 
of 17.9%. With the exception of modifying the learning 
rate sls a = aoy/fiX/128, all the other neural network’s 
hyperparameters were kept unchanged from the baseline 
configuration while generating the data points for different 
values of ji and A. Note that it is possible to achieve test 
error lower than the baseline by reducing the mini-batch size 
from 128 to 4. However, this configuration (indicated on the 
plot as (cr,/i. A) = (0,4,1)) increases training time compared 
with the baseline. This is primarily due to the fact that the 
dominant computation performed by the learners involves 
multiple calls to matrix multiplication (GEMM) to compute 
WX where samples in a mini-batch form columns of the 
matrix X. Reducing the mini-batch size cause a proportionate 
decrease in the GEMM throughput and slower processing of 
the mini-batch by the learner. 

In Figure the contour labeled /i = 128 is the configura¬ 
tions with the mini-batch size per learner is kept constant at 
128 and the number of learners is varied from A = 1 to A = 30. 

^The mapping between A and the number of computing nodes 77 is (A, 77 ) = 
{(1,1), (2,1), (4, 2), (10, 4), (18, 4), (30, 8 )} 




















The training time reduces monotonically with A, albeit at the 
expense of an increase in the test error. Traversing along the 
A = 30 contour from configuration (a, /i, A) = (0,128,30) to 
(cr,/i,A) = (0,4,30) (i.e. reducing the mini-batch size from 
128 to 4) helps restore much of this degradation in the test 
error by partially sacrificing the speed-up obtained by the 
virtue of scaling out to 30 learners. 




(a) (b) 

Fig. 7. Speed-up in the training time for mini-batch size and (a) fi = 128 (b) 
^ = 4 for 3 different protocols: hardsync, A-softsync, and 1-softsync. Speed¬ 
up numbers in (a) and (b) are calculated relative to (a, fi, A) = (0,128,1) 
and (cr, /i, A) = (0, 4,1), respectively. 


Figure 6(a) shows (a, /i. A) tradeoff curves for the A- 
softsync protocol. In this protocol, the parameter server up¬ 
dates the weights as soon as it receives a gradient from any of 
the learners. Therefore, as shown in section |V-A| the average 
gradient staleness (a) = A and dmax < 2A with high probabil¬ 
ity. The learning rate is set in accordance with equation All 
the other hyperparameters are left unchanged from the baseline 
configuration. Qualitatively, the (cr, /i. A) tradeoff curves for 
A-softsync look similar to those observed for the hardsync 
protocol. In this case, however, the degradation in the test error 
relative to the baseline for the (cr, /i, A) = (30,128, 30) config¬ 
uration is much more pronounced. As observed previously, this 
increase in the test error can largely be mitigated by reducing 
the size of mini-batch processed by each learner (A = 30 
contour). Note, however, the sharp increase in the training 
time for the configuration (cr, /i. A) = (30,4, 30) as compared 
with (cr, /i, A) = (30,128, 30). The smaller mini-batch size not 
only reduces the computational throughput at each learner, 
but also increases the frequency of pushGradient and 
pullWeights requests at the parameter server. In addi¬ 
tion, small mini-batch size increases the frequency of weight 
updates at the parameter server. Since in the Rudra-base 
architecture (section III-B| ), the learner does not proceed with 
the computation on the next mini-batch till it has received the 
updated gradients, the traffic at the parameter server and the 
more frequent weight updates can delay servicing the learner’s 
pullWeights request, potentially stalling the gradient com¬ 
putation at the learner. Interestingly, all the configurations 
along the /i = 4 contour show similar, if not better, test error 
as the baseline. For these configurations, the average staleness 
varies between 2 and 30. From this empirical observation, we 
infer that a small mini-batch size per learner confers upon the 
training algorithm a fairly high degree of immunity to stale 
gradients. 

The 1-softsync protocol shows (cr, /i. A) tradeoff curves 


(Figure |6(b)| ) that appear nearly identical to those observed 
for the A-softsync protocol. In this case, the average staleness 
is 1 irrespective of the number of learners. Since the parameter 
server waits for A gradients to arrive before updating the 
weights, there is a net reduction in the pullWeights traffic 
at the parameter server (see section |III-B[ ). As a result, the 
1-softsync protocol avoids the degradation in runtime ob¬ 
served in the A-softsync protocol for the configuration with 
/i = 4 and A = 30. The distinction in terms of the runtime 
performance between these two protocols becomes obvious 
when comparing the speed-ups obtained for different mini¬ 
batch sizes (Figure [7]). For p = 128, the 1-softsync and 
A-softsync protocol demonstrate similar speed-ups over the 
baseline configuration for upto A = 30 learners. In this case, 
the communication between the learners and the parameter 
server is sporadic enough to prevent the learners from waiting 
on the parameter server for updated weights. However, as the 
number of learners is increased beyond 30, the bottlenecks 
at the parameter server are expected to diminish the speed-up 
obtainable using the A-softsync protocol. The effect of frequent 
pushGradient and pullWeights requests due to smaller 
at the parameter manifest clearly as the mini-batch size is 
reduced to 4, in which case, the A-softsync protocol shows 
subdued speed-up compared with 1-softsync protocol. In either 
scenario, the hardsync protocol fares the worst in terms of 
runtime performance improvement when scaling out to large 
number of learners. The upside of adopting the hardsync 
protocol, however, is that it achieves substantially lower test 
error, even for large mini-batch sizes. 


C. pX = constant 

In the hardsync protocol, given a configuration with A 
learners and mini-batch size p per learner, the parameter server 
averages the A number of gradients reported to it by the 
learners. Using equations 0 and 0: 
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The last step equation 0 is valid since each training ex¬ 
ample (Xg^Ys) is drawn independently from the training 
set and also because the hardsync protocol ensures that all 
the learners compute gradients on identical set of weights 
i.e. V / G {1, 2,..., A}. According to 

equation 0, the configurations (cr, /i. A) = (0,/ioAo,l) and 
{cr,p.,X) = (0,/io, Ao) are equivalent from the perspective of 
stochastic gradient descent optimization. In general, hardsync 
configurations with the same pX product are expected to give 
nearl}|^the same test error. 

In Table |n| we report the test error at the end of 140 
epochs for configurations with pX = constant. Interestingly, 


^ small differences in the final test error achieved may arise due to the 
inherent nondeterminism of random sampling in stochastic gradient descent 
and the random initialization of the weights. 


















cr 

U 

A 

Test 

error 

Training 

time(s) 



1 

4 

30 

18.09% 

1573 



30 

4 

30 

18.41% 

2073 


^ 128 

18 

8 

18 

18.92% 

2488 



10 

16 

10 

18.79% 

3396 



4 

32 

4 

18.82% 

7776 



2 

64 

2 

17.96% 

13449 



1 

8 

30 

20.04% 

1478 



30 

8 

30 

19.65% 

1509 



18 

16 

18 

20.33% 

2938 

/J,X r 

256 

10 

32 

10 

20.82% 

3518 



4 

64 

4 

20.70% 

6631 



2 

128 

2 

19.52% 

11797 



1 

128 

2 

19.59% 

11924 



1 

16 

30 

23.25% 

1469 



30 

16 

30 

22.14% 

1502 

liX ^ 

^ 512 

18 

32 

18 

23.63% 

2255 



10 

64 

10 

24.08% 

2683 



4 

128 

4 

23.01% 

7089 



1 

32 

30 

27.16% 

1299 



30 

32 

30 

27.27% 

1420 

liX ^ 

i 1024 

18 

64 

18 

28.31% 

1713 



1 

128 

10 

29.83% 

2551 



10 

128 

10 

29.90% 

2626 
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Results on CIFARIO benchmark: Test error at the end OE 140 
EPOCHS AND TRAINING TIME EOR (cr, /i, A) CONEIGURATIONS WITH 
= CONSTANT. 


we find that even for the n-softsync protocol, configurations 
that maintain fiX = constant achieve comparable test errors. At 
the same time, the test error turns out to be rather independent 
of the staleness in the gradients for a given /iA product. For 
instance. Tableshows that when jiX ^ 128, but the (average) 
gradient staleness is allowed to vary between 1 and 30, the test 
error stays ^18-19%. Although this result may seem counter¬ 
intuitive, a plausible explanation emerges when considering 
the measurements shown earlier in Figure that our im¬ 
plementation of the n-softsync protocol achieves an average 
gradient staleness of n while bounding the maximum staleness 
at 2n. Consequently, at any stage in the gradient descent 
algorithm, the weights being used by the different learners 
do not differ significantly and can be considered to 
be approximately the same. The quality of this approximation 
improves when each update 

+ 1 ) = - aVe^^\i) 

creates only a small displacement in the weight space. This 
can be controlled by suitably tuning the learning rate a. 
Qualitatively, the learning rate should decrease as the staleness 
in the system increases in order to reduce the divergence across 
the weights seen by the learners. The learning rate modulation 
of equation ([^ achieves precisely this effect. 

These results help define a principled approach for dis¬ 
tributed training of neural networks: the mini-hatch size per 
learner should be reduced as more learners are added to the 
system in way that keeps pX product constant. In addition, 
the learning rate should be modulated to account for stale 
gradients. In Table |n| 1-softsync (cr = 1) protocol invariably 


a 

A 

A 

Synchronization 

protocol 

Test 

error 

Training 

time(s) 

1 

4 

30 

1-softsync 

18.09% 

1573 

0 

8 

30 

Hardsync 

18.56% 

1995 

30 

4 

30 

30-softsync 

18.41% 

2073 

0 

4 

30 

Hardsync 

18.15% 

2235 

18 

8 

18 

18-softsyc 

18.92% 

2488 


'lABLH 111 


Results on CIFARIO benchmark: TOP-5 BEST (cr,/x, A) CONEIGURATIONS 
THAT ACHIEVE A COMBINATION OE LOW TEST ERROR AND SMALL 
TRAINING TIME. 


shows the smallest training time for any pX. This is expected, 
since the 1-softsync protocol minimizes the traffic at the 
parameter server. Table [I^ also shows that the test error 
increases monotonically with the pX product. These results 
reveal the scalability limits under the constraints of preserving 
the model accuracy. Since the smallest possible mini-batch size 
is 1, the maximum number of learners is bounded. This upper 
bound on the maximum number of learners can be relaxed 
if an inferior model is acceptable. Alternatively, it may be 
possible to reduce the test error for higher pX by running 
for more number of epochs. In such a scenario, adding more 
learners to the system may give diminishing improvements 
in the overall runtime. From machine learning perspective, 
this points to an interesting research direction on designing 
optimization algorithm and learning strategies that perform 
well with large mini-batch sizes. 


D. Summary of results on CIFARIO benchmark 

Table HIH summarizes the results obtained on the CIFARIO 
dataset using the Rudra-base system architecture. As a ref¬ 
erence for comparison, the baseline configuration (cr, /i, A) = 
(0,128,1) achieves a test error of 17.9% and takes 22,392 
seconds to finish training 140 epochs. 


E. Results on ImageNet benchmark 

The large model size of the neural network used for the 
ImageNet benchmark and the associated computational cost 
of training this model prohibits an exhaustive state space 
exploration. The baseline configuration (/i = 256, A = 1) 


takes 54 hours/epoch. Guided by the results of section V-C 


we first consider a configuration with /i = 16,A = 18 and 
employ the Rudra-base architecture with hardsync protocol 
(base-hardsync). This configuration performs training at the 
speed of ^330 minutes/epoch and achieves a top-5 error of 
20.85%, matching the accuracy of the baseline configuration 
(/i = 256, A = 1, section [TV^ ). 

The synchronization overheads associated with the hardsync 
protocol deteriorate the runtime performance and the training 
speed can be further improved by switching over to the 1- 
softsync protocol. Training using the 1-softsync protocol with 
mini-batch size of /i = 16 and 18 learners takes ^270 
minutes/epoch, reaching a top-1 (top-5) accuracy of 45.63% 
(22.08%) by the end of 30 epochs (base-softsync). For this 
particular benchmark, the training setup for the 1-softsync 
protocol differs from the hardsync protocol in certain subtle. 














Configuration 

Architecture 

h 

A 

Synchronization 

protocol 

Validation 

error(top-l) 

Validation 
error (top-5) 

Training time 
(minutes/epoch) 

base-hardsync 

Rudra-base 

16 

18 

Hardsync 

44.35% 

20.85% 

330 

base-softsync 

Rudra-base 

16 

18 

1-softsync 

45.63% 

22.08% 

270 

adv-softsync 

Rudra-adv 

4 

54 

1-softsync 

46.09% 

22.44% 

212 

adv* -softsync 

Rudra-adv* 

4 

54 

1-softsync 

46.53% 

23.38% 

125 


TABLE IV 

Results on ImageNet benchmark: VALIDATION ERROR AT THE END OE 30 EPOCHS AND TRAINING TIME PER EPOCH EOR DIEEERENT CONEIGURATIONS. 


but important ways. First, we use an adaptive learning rate 
method (AdaGrad I2ll6)) to improve the stability of SGD when 
training using the 1-softsync protocol. Second, to improve 
convergence we adopt the strategy of warmstarting 1^ the 
training procedure by initializing the network’s weights from 
a model trained with hardsync for 1 epoch. 

Further improvement in the runtime performance may be 
obtained by adding more learners to the system. However, 
as observed in the previous section, increase in the number 
of learners needs to be accompanied by a reduction in the 
mini-batch size to prevent degradation in the accuracy of the 
trained model. The combination of a large number of learners 
and a small mini-batch size represents a scenario where the 
Rudra-base architecture may suffer from a bottleneck at the 
parameter server due to the frequent pushGradient and 
pullWeights requests. These effects are expected to be 
more pronounced for large models such as ImageNet. The 
Rudra-adv architecture alleviates these bottlenecks, to some 
extent, by implementing a parameter server group organized in 
a tree structure. A = 54 learners, each processing a mini-batch 
size /i = 4 trains at ^212 minutes/epoch when using Rudra- 
adv architecture and 1-softsync protocol (adv-softsync). As in 
the case of Rudra-base, the average staleness in the gradients 
is close to 1 and this configuration also achieves a top-1 (top-5) 
error of 46.09%(22.44%). 

The Rudra-adv* architecture improves the runtime further 
by preventing the computation at the learner from stalling 
on the parameter server. However, this improvement in per¬ 
formance comes at the cost of increasing the average stale¬ 
ness in the gradients, which may deteriorate the quality of 
the trained model. The previous configuration runs at ^125 
minutes/epoch, but suffers an increase in the top-1 validation 
error (46.53%) when using Rudra-adv* architecture (aJv*- 
softsync). Table [Iv] summarizes the results obtained for the 
4 configurations discussed above. It is worth mentioning that 
the configuration /i = 8, A = 54 performs significantly 
worse, producing a model that gives top-1 error of > 50% 
but trains at a speed of ~96 minutes/epoch. This supports our 
observation that scaling out to large number of learners must 
be accompanied by reducing the mini-batch size per learner 
so the quality of the trained model can be preserved. 

Figure [^compares the evolution of the top-1 validation error 
during training for the 4 different configuration summarized in 
Table [rv| The training speed follows the order adv"" -softsync > 
adv-softsync > base-softsync > base-hardsync. As a result, 
adv"" -softsync is the first configuration to hit the 48% validation 



Fig. 8. Results on ImageNet benchmark: Error on the vali datio n set as a 
function of training time for the configurations listed in Table |IV| 

error mark. Configurations other than base-hardsync show 
marginally higher validation error compared with the baseline. 
As mentioned earlier, the experiments with 1-softsync proto¬ 
col use AdaGrad to achieve stable convergence. It is well- 
documented EH ED that AdaGrad is sensitive to the initial 
setting on the learning rates. We speculate that tuning the 
initial learning rate can help recover the slight loss of accuracy 
for the 1-softsync runs. 

VI. Related Works 

To accelerate training of deep neural networks and han¬ 
dle large dataset and model size, many researchers have 
adopted GPU-based solutions, either single-GPU csi or multi- 
GPU 12^ GPUs provide enormous computing power and are 
particularly suited for the matrix multiplications which are 
the core of many deep learning implementations. However, 
the relatively limited memory available on GPUs may restrict 
their applicability to large model sizes. 

Distbelief ID pioneered scale-out deep learning on CPUs. 
Distbelief is built on tens of thousands of commodity PCs and 
employs both model parallelism via dividing a single model 
between learners, and data parallelism via model replication. 
To reduce traffic to the parameter server, Distbelief shards 
parameters over a parameter server group. Learners asyn¬ 
chronously push gradients and pull weights from the parameter 
server. The frequency of communication can be tuned via 
npush and nfetch parameters. 

More recently, Adam O employs a similar system archi¬ 
tecture to DistBelief, while improving on Distbelief in two 
respects: (1) better system tuning, e.g. customized concurrent 
memory allocator, better linear algebra library implementation, 
and passing activation and error gradient vector instead of the 
weights update; and (2) leveraging the recent improvement in 










machine learning, in particular convolutional neural network 
to achieve better model accuracy. 

In any parameter server based deep learning system ca, 
staleness will negatively impact model accuracy. Orthogonal to 
the system design, many researchers have proposed solutions 
to counter staleness in the system, such as bounding the 
staleness in the system 0 or changing optimization objective 
function, such as elastic averaging SGD ||25]| . In this paper, we 
empirically study how staleness affects the model accuracy and 
discover the heuristics that smaller mini-batch size can effec¬ 
tively counter system staleness. In our experiments, we derive 
this heuristics from a small problem size(e.g., CIFARIO) and 
we find this heuristic is applicable even to much larger problem 
size (e.g., ImageNet). Our finding coincides with a very 
recent theoretical paper CD, in which the authors prove that in 
order to achieve linear speedup using asynchronous protocol 
while maintaining good model accuracy, one needs to increase 
the number of weight updates conducted at the parameter 
server. In our system, this theoretical finding is equivalent to 
keeping constant number of training epochs while reducing the 
mini-batch size. To the best of our knowledge, our work is the 
first systematic study of the tradeoff between model accuracy 
and runtime performance for distributed deep learning. 

VII. Conclusion 

In this paper, we empirically studied the interplay of hyper¬ 
parameter tuning and scale-out in three protocols for commu¬ 
nicating model weights in asynchronous stochastic gradient 
descent. We divide the learning rate by the average staleness of 
gradients, resulting in faster convergence and lower test error. 
Our experiments show that the l-softsync protocol (in which 
the parameter server accumulates A gradients before updating 
the weights) minimizes gradient staleness and achieves the 
lowest runtime for a given test error. We found that to maintain 
a model accuracy, it is necessary to reduce the mini-batch 
size as the number of learners is increased. This suggests an 
upper limit on the level of parallelism that can be exploited 
for a given model, and consequently a need for algorithms that 
permit training over larger batch sizes. 
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